STAT 301

Midterm Review

Statistical Modelling

3 models under a common framework

Model Response Distribution Parameter Linear
MLR continuous Normal mean of response identity
Logistic binary Bernoulli probability of S log-odds
Poisson counts Poisson mean count log
  • mean(house price) = \(\beta_0\) + \(\beta_1\)size + \(\beta_2\)basementY

  • logit(prob. of default) = \(\beta_0\) + \(\beta_1\)balance + \(\beta_2\)income + \(\beta_3\)studentY

  • log(mean(number of bikes)) = \(\beta_0\) + \(\beta_1\)temperature

These functions are used to interpret the regression coefficients!!!

Interpretation

Model continuous covariate
MLR an increase of 1 unit in the covariate is associated with an estimated increase/decrease of \(|\hat{\beta}_1|\) in the mean response
Logistic an increase of 1 unit in the covariate is associated with an estimated increase/decrease in the log odds of success by \(|\hat{\beta}_1|\), or a change in the odds of success by a factor of \(e^{\hat{\beta}_1}\)
Poisson an increase of 1 unit in the covariate is associated with an estimated increase/decrease in the log average counts by \(|\hat{\beta}_1|\), or or a change in the mean counts by a factor of \(e^{\hat{\beta}_1}\)

Interpretation: examples

Scroll down to see full content

  • Logistic: a dollar increase in balance is associated with
    • an increase in the estimated log odds of default by 0.0055, or
    • an increase in the estimated odds of default by a factor of 1.0055,
    • or an increase in the estimated odds of 0.55%
  • Poisson: a Celsius degree increase in temperature is associated with
    • an increase in the estimated log mean number of bikes rented of 0.62, or
    • an increase in the estimated mean number of bikes by a factor of 1.86 (\(e^{0.62}\)),
    • or an increase in the estimated mean number of bikes of 86% ((1.86-1) \(\times\) 100)

Key words

Scroll down to see full content

  • estimate or estimated (these are not population quantities, they depend on the sample)

  • associated with (if the data comes from an observational study causation can not be established)

  • “by a factor of” or “times” or “percent” if estimated coefficients are exponentiated

  • “holding other variables constant at any value” if the model is additive and has more variables, otherwise don’t!

  • check units!!

Additive models vs interactions

  • for additive models: coefficients are interpreted holding other variables constant at any value

  • for models with interactions: interpretations depend on levels or values of other variables

Warning

Interaction terms do not model correlations between covariates!!

We use interactions when the association between a covariate and the response depends on another covariate(s)

Example of additive MLR

  • for a given amount of spending in TV and newspaper advertising, spending an additional $1000 on radio advertising is associated with an estimated increase in sales by approximately $189

or “keeping the spending in TV and newspaper advertising constant at any value”

Table from ISLR

Hypothesis tests and CI

We need the sampling distribution (distribution of the estimators of the regression coefficients, \(\hat{\beta}_j\))

For the 3 models, we usually use a Normal approximation of the sampling distribution (details beyond this course)

  • for MLR: we usually need to estimate the variance of the error term (\(\sigma\)) and the sampling distribution becomes a \(t\)-student distribution
    • we estimate \(\sigma\) using the RMSE, given by glance()
  • for Logistic and Poisson: the variance of the response depends on their mean so we don’t need to estimate it separately, thus we can use a Normal approximation as the sampling distribution
    • recall that if sometimes we observe overdispersion (or underdispersion)

Interpretations of tests

Scroll down to see full content

The null hypothesis: \(H_0: \beta_j = 0\)

  • check the significance level

  • check the alternative hypothesis

  • the interpretion depends on the meaning of the coefficient

We have statistically significant evidence (p-value < .001) that the mean number of fires is positively associated with temperature.

  • Remember that we never know what the true population parameters are! At a significance level:

    • we have evidence to reject \(H_0\). It doesn’t mean \(H_1\) is true!

    or

    • we don’t have enough evidence to reject (we fail to reject) \(H_0\). It doesn’t mean \(H_0\) is true!

CIs of coefficients

Recall that once the data is collected, the intervals are not random and we interpret them in terms of confidence, not probabilities!

We are 95% confident that each additional degree in temperature is associated with an increase in the mean number of fires between 5% and 10%.

We can use bootstrapping to compute CIs

Assumptions and Diagnosis

In worksheet_03 and tutorial_03 we used simulations to study the relevance of assumptions made:

  • Normality: only for MLR, not needed for estimation or large sample apprximations but the linear model will be a good fit if the assumption holds.

  • Homoscedasticity: only for MLR, the spread (or variance) of the errors in a model is the same across all levels of the covariate(s).

  • Confounding Factor: a variable related with a covariate and the response, and it can make it look like there’s an association between them, even if there isn’t.

  • Multicollinearity: correlation between covariates.

  • Independence: we assume that observations are independent of each other

Predictions and Residuals

Scroll down to see full content

  • Predictions are values of the response computed with the estimated model for fixed values of the covariates:

    • fitted: common name for in-sample predictions
  • Residuals is the difference between the observed response and the fitted values

  • Fitted values and residuals can be computed for the 3 models: MLR, Logistic and Poisson

  • However: recall that different quantities can be predicted with logistic (log-odds, odds, probabilities) and Poisson (log-counts, counts)

  • In Logistic and Poisson, the variance of each observation depends on the covariates (not constant), so residuals are adjusted (e.g., Pearson, deviance)

  • See tutorial_04 and tutorial_05